5  Logistic Regression 3

Author

Lin Yu

Published

February 14, 2025

library(dplyr)
library(ggplot2)
library(DT)
library(here)

5.1 Recap of Tutorial 4

library(rms)
Loading required package: Hmisc

Attaching package: 'Hmisc'
The following objects are masked from 'package:dplyr':

    src, summarize
The following objects are masked from 'package:base':

    format.pval, units
load(here("data","tutdata.RData"))
dd <- datadist(tutdata)
options(datadist = "dd")
fit1 <- lrm(y ~ blood.pressure + sex + age + rcs(cholesterol,4), data = tutdata)
fit2 <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)), data = tutdata,x=TRUE, y= TRUE)
anova(fit2)
                Wald Statistics          Response: y 

 Factor                                           Chi-Square d.f. P     
 blood.pressure                                    0.26       1   0.6105
 sex  (Factor+Higher Order Factors)               38.71       5   <.0001
  All Interactions                                26.13       4   <.0001
 age  (Factor+Higher Order Factors)               30.42       2   <.0001
  All Interactions                                 3.86       1   0.0495
 cholesterol  (Factor+Higher Order Factors)       23.78       6   0.0006
  All Interactions                                22.47       3   0.0001
  Nonlinear (Factor+Higher Order Factors)          5.33       4   0.2550
 sex * age  (Factor+Higher Order Factors)          3.86       1   0.0495
 sex * cholesterol  (Factor+Higher Order Factors) 22.47       3   0.0001
  Nonlinear                                        4.79       2   0.0911
  Nonlinear Interaction : f(A,B) vs. AB            4.79       2   0.0911
 TOTAL NONLINEAR                                   5.33       4   0.2550
 TOTAL INTERACTION                                26.13       4   <.0001
 TOTAL NONLINEAR + INTERACTION                    26.81       6   0.0002
 TOTAL                                            62.26      10   <.0001
anova(fit2)
                Wald Statistics          Response: y 

 Factor                                           Chi-Square d.f. P     
 blood.pressure                                    0.26       1   0.6105
 sex  (Factor+Higher Order Factors)               38.71       5   <.0001
  All Interactions                                26.13       4   <.0001
 age  (Factor+Higher Order Factors)               30.42       2   <.0001
  All Interactions                                 3.86       1   0.0495
 cholesterol  (Factor+Higher Order Factors)       23.78       6   0.0006
  All Interactions                                22.47       3   0.0001
  Nonlinear (Factor+Higher Order Factors)          5.33       4   0.2550
 sex * age  (Factor+Higher Order Factors)          3.86       1   0.0495
 sex * cholesterol  (Factor+Higher Order Factors) 22.47       3   0.0001
  Nonlinear                                        4.79       2   0.0911
  Nonlinear Interaction : f(A,B) vs. AB            4.79       2   0.0911
 TOTAL NONLINEAR                                   5.33       4   0.2550
 TOTAL INTERACTION                                26.13       4   <.0001
 TOTAL NONLINEAR + INTERACTION                    26.81       6   0.0002
 TOTAL                                            62.26      10   <.0001

compare two models using LRT. To use LRT, one model needs to be nested within a larger model.

lrtest(fit1,fit2)

Model 1: y ~ blood.pressure + sex + age + rcs(cholesterol, 4)
Model 2: y ~ blood.pressure + sex * (age + rcs(cholesterol, 4))

  L.R. Chisq         d.f.            P 
2.831616e+01 4.000000e+00 1.076137e-05 

5.2 Tutorial 5

set.seed(1017)
validate(fit2, B = 100)
          index.orig training   test optimism index.corrected   n
Dxy           0.2832   0.3058 0.2682   0.0376          0.2456 100
R2            0.0896   0.1059 0.0795   0.0263          0.0632 100
Intercept     0.0000   0.0000 0.0389  -0.0389          0.0389 100
Slope         1.0000   1.0000 0.8601   0.1399          0.8601 100
Emax          0.0000   0.0000 0.0401   0.0401          0.0401 100
D             0.0684   0.0817 0.0604   0.0213          0.0470 100
U            -0.0020  -0.0020 0.0019  -0.0039          0.0019 100
Q             0.0704   0.0837 0.0585   0.0252          0.0452 100
B             0.2316   0.2288 0.2342  -0.0054          0.2369 100
g             0.6230   0.6828 0.5790   0.1038          0.5192 100
gp            0.1457   0.1566 0.1357   0.0209          0.1248 100
  • index.orig: The performance metric computed on the full dataset.
  • training: The metric computed on the training dataset (bootstrap sample). - test: The metric computed on the test dataset (out-of-bootstrap sample).
  • optimism: The difference between training and test performance, estimating overfitting.
  • index.corrected: The optimism-adjusted estimate of model performance. n: Number of bootstrap iterations (100)
  • g (Gini coefficient): Related to AUC (Area Under the Curve)
  • Emax (Maximum Calibration Error): Largest difference between predicted and observed probabilities.
dff <- resid(fit2, "dffits")
plot(dff)

show.influence(which.influence(fit2), data.frame(tutdata, dffits = dff), report = c("dffits"))
    Count     sex cholesterol     dffits
143     2  female   *274.0808  1.6108318
173     4 *female   *138.5056 -1.1265158
181     2 *male      146.4978  0.8026558
408     2 *male      148.1649  0.6813596
411     1  female    146.3522  0.5594519
834     2  female   *146.6413 -0.4908278
840     2  female   *147.0025 -0.5604495

The dffits statistic measures the influence of each observation on the fitted model. It represents the change in the predicted value for a data point if that data point were removed from the model.

# Partial residuals are the residuals from the model, adjusted for the effect of the predictors that are not of primary interest
resid(fit2, "partial", pl = "loess")

  • Non-linearity: If the loess curve shows a clear non-linear relationship between a predictor and the residuals, you might need to consider transforming that predictor (e.g., using log or polynomial terms) to improve the model fit.
  • Outliers or influential points: Any points that deviate significantly from the smoothed curve might be influential or outliers, warranting further investigation.

5.3 Boostrapping

resampling technique used to estimate the sampling distribution of a statistic by repeatedly sampling with replacement from the original dataset, useful when theoretical distributions are unknown or difficult to derive.

for example, say we want to calculate the confidence interval for sample mean, we need to know the distribution of the sample mean:

set.seed(1017)

## pop data
n <- 1000
data <- rnorm(n, mean=5, sd=10) 

analytical_lower <- mean(data) - 1.96*sd(data)/sqrt(n)
analytical_upper <- mean(data) + 1.96*sd(data)/sqrt(n)


nboot <- 1000

## sample(data, size=n, replace=TRUE) bootstrap sampling
## mean(sample(data, size=n, replace=TRUE)) calculate sample mean
boot_means <- replicate(nboot, mean(sample(data, size=n, replace=TRUE)))

boot_means %>% density() %>% plot()

# percentile 
boot_lower_bound <- quantile(boot_means, 0.025)
boot_upper_bound <- quantile(boot_means, 0.975)


cat("Original Sample Mean:", mean(data), "\n")
Original Sample Mean: 4.87006 
cat("95% CI for Mean (analytical solution):", analytical_lower, analytical_upper, "\n")
95% CI for Mean (analytical solution): 4.247337 5.492783 
cat("95% Bootstrapped (nboot=1000) CI for Mean:", boot_lower_bound, boot_upper_bound, "\n")
95% Bootstrapped (nboot=1000) CI for Mean: 4.220525 5.464462 

what would happen if we increase the number of bootstrap?

nboot <- 10000

## sample(data, size=n, replace=TRUE) bootstrap sampling
## mean(sample(data, size=n, replace=TRUE)) calculate sample mean
boot_means2 <- replicate(nboot, mean(sample(data, size=n, replace=TRUE)))

# percentile 
boot_lower_bound <- quantile(boot_means2, 0.025)
boot_upper_bound <- quantile(boot_means2, 0.975)
cat("95% CI for Mean (analytical solution):", analytical_lower, analytical_upper, "\n")
95% CI for Mean (analytical solution): 4.247337 5.492783 
cat("95% Bootstrapped(boot=10000) CI for Mean:", boot_lower_bound, boot_upper_bound, "\n")
95% Bootstrapped(boot=10000) CI for Mean: 4.242179 5.499558